90 research outputs found
Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization
State-of-the-art temporal action detectors inefficiently search the entire
video for specific actions. Despite the encouraging progress these methods
achieve, it is crucial to design automated approaches that only explore parts
of the video which are the most relevant to the actions being searched for. To
address this need, we propose the new problem of action spotting in video,
which we define as finding a specific action in a video while observing a small
portion of that video. Inspired by the observation that humans are extremely
efficient and accurate in spotting and finding action instances in video, we
propose Action Search, a novel Recurrent Neural Network approach that mimics
the way humans spot actions. Moreover, to address the absence of data recording
the behavior of human annotators, we put forward the Human Searches dataset,
which compiles the search sequences employed by human annotators spotting
actions in the AVA and THUMOS14 datasets. We consider temporal action
localization as an application of the action spotting problem. Experiments on
the THUMOS14 dataset reveal that our model is not only able to explore the
video efficiently (observing on average 17.3% of the video) but it also
accurately finds human activities with 30.8% mAP.Comment: Accepted to ECCV 201
Few-Shot Transformation of Common Actions into Time and Space
This paper introduces the task of few-shot common action localization in time
and space. Given a few trimmed support videos containing the same but unknown
action, we strive for spatio-temporal localization of that action in a long
untrimmed query video. We do not require any class labels, interval bounds, or
bounding boxes. To address this challenging task, we introduce a novel few-shot
transformer architecture with a dedicated encoder-decoder structure optimized
for joint commonality learning and localization prediction, without the need
for proposals. Experiments on our reorganizations of the AVA and UCF101-24
datasets show the effectiveness of our approach for few-shot common action
localization, even when the support videos are noisy. Although we are not
specifically designed for common localization in time only, we also compare
favorably against the few-shot and one-shot state-of-the-art in this setting.
Lastly, we demonstrate that the few-shot transformer is easily extended to
common action localization per pixel
Object priors for classifying and localizing unseen actions
This work strives for the classification and localization of human actions in
videos, without the need for any labeled video training examples. Where
existing work relies on transferring global attribute or object information
from seen to unseen action videos, we seek to classify and spatio-temporally
localize unseen actions in videos from image-based object information only. We
propose three spatial object priors, which encode local person and object
detectors along with their spatial relations. On top we introduce three
semantic object priors, which extend semantic matching through word embeddings
with three simple functions that tackle semantic ambiguity, object
discrimination, and object naming. A video embedding combines the spatial and
semantic object priors. It enables us to introduce a new video retrieval task
that retrieves action tubes in video collections based on user-specified
objects, spatial relations, and object size. Experimental evaluation on five
action datasets shows the importance of spatial and semantic object priors for
unseen actions. We find that persons and objects have preferred spatial
relations that benefit unseen action localization, while using multiple
languages and simple object filtering directly improves semantic matching,
leading to state-of-the-art results for both unseen action classification and
localization.Comment: Accepted to IJC
- …